Search CORE

64 research outputs found

Active Data : Un modèle pour représenter et programmer le cycle de vie des données distribuées

Author: Simonet Anthony
Publication venue: HAL CCSD
Publication date: 23/04/2014
Field of study

National audienceAlors que la science génère et traite des ensembles de données toujours plus grands et dynamiques, un nombre croissant de scientifiques doit faire face à des défis pour permettre leur exploitation. La gestion de données par les applications scientifiques de traitement intensif des données requière le support de cycles de vie très complexes, la coordination de nombreux sites, de la tolérance aux pannes et de passer à l'échelle sur des dizaines de sites avec plusieurs péta-octets de données. Dans cet article, nous proposons un modèle pour représenter formellement les cycles de vie des applications de traitement de données et un modèle de programmation pour y réagir dynamiquement. Nous discutons du prototype d'implémentation et présentons différents cas d'études d'applications qui démontrent la pertinence de notre approche

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Active Data: A Data-Centric Approach to Data Life-Cycle Management

Author: Al-Kiswany Samer
Fedak Gilles
Ripeanu Matei
Simonet Anthony
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/11/2013
Field of study

International audienceData-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally different management approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.Les importants volumes de données produits par la science présentent de nouvelles opportunités d'innovation et de découvertes. Cependant ceci sera conditionné par notre capacité à gérer efficacement de très grands jeux de données. La gestion de données pour les applications scientifiques data-intensive présente un véritable défi~; elle requière le support de cycles de vie très complexes, la coordination de plusieurs sites, de la tolérance aux pannes et de passer à l'échelle sur des dizaines de sites avec plusieurs péta-octets de données. Dans cet article nous argumentons que la gestion des données pour les applications scientifiques data-intensive nécessite une approche fondamentalement différente de l'actuel paradigme centré sur les tâches. Nous proposons Active Data, un nouveau paradigme pour la gestion du cycle de vie des données. Active Data suit deux principes~: il est centré sur les données et à base d'événements. Nous présentons le modèle de programmation Active Data, un prototype d'implémentation et discutons des avantages et limites de notre approche à partir d'étude de cas d'applications scientifiques

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Energy-Aware Massively Distributed Cloud Facilities: The DISCOVERY Initiative

Author: Desprez Frédéric
Ibrahim Shadi
Lebre Adrien
Orgerie Anne-Cécile
Pastor Jonathan
Simonet Anthony
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2015
Field of study

International audienceInstead of the current trend consisting of building larger and larger data centers (DCs) in few strategic locations, the DISCOVERY initiative proposes to leverage any network point of presences (PoP, i.e., a small or medium-sized network center) available through the Internet. The key idea is to demonstrate a widely distributed Cloud platform that can better match the geographical dispersal of users and of renewable energy sources. This involves radical changes in the way resources are managed, but leveraging computing resources around the end-users will enable to deliver a new generation of highly efficient and sustainable Utility Computing (UC) platforms, thus providing a strong alternative to the actual Cloud model based on mega DCs (i.e., DCs composed of tens of thousands resources). This poster will present the DISCOVERY initiative efforts towards achieving energy-aware massively distributed cloud facilities. To satisfy the escalating demand for Cloud Computing (CC) resources while realizing economy of scale, the production of computing resources is concentrated in mega data centers (DCs) of ever-increasing size, where the number of physical resources that one DC can host is limited by the capacity of its energy supply and its cooling system. To meet these critical needs in terms of energy supply and cooling, the current trend is toward building DCs in regions with abundant and affordable electricity supplies or in regions close to the polar circle to leverage free cooling techniques [1]. However, concentrating Mega-DCs in only few attractive places implies different issues. First, a disaster in these areas would be dramatic for IT services the DCs host as the con-nectivity to CC resources would not be guaranteed. Second, in addition to jurisdiction concerns, hosting computing resources in a few locations leads to useless network overheads to reach each DC. Such overheads can prevent the adoption of the UC paradigm by several kinds of applications such as mobile computing or big data ones

HAL-CentraleSupelec

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL Mines Nantes

HAL-Rennes 1

Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures

Author: Fedak Gilles
Ripeanu Matei
Simonet Anthony
Publication venue: HAL CCSD
Publication date: 07/09/2012
Field of study

The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. We propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first define the concept of data life cycle and introduce a formal model that allow to expose data life cycle across heterogeneous systems and infrastructures. The Active Data programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, repli-cation, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can still benefit from Active Data to improve their performances

INRIA a CCSD electronic archive server

Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures

Author: Fedak Gilles
Ripeanu Matei
Simonet Anthony
Publication venue: 'Elsevier BV'
Publication date: 01/12/2015
Field of study

International audienceThe Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing. We propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first define the concept of data life cycle and introduce a formal model that allows to expose data life cycle across heterogeneous systems and infrastructures. The Active Data programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can still benefit from Active Data to improve their performances

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Zero-Knowledge : trust and privacy on an industrial scale

Author: Augot Daniel
Bordage Sarah
El Housni Youssef
Fedak Gilles
Simonet Anthony
Publication venue: HAL CCSD
Publication date: 05/01/2022
Field of study

INRIA a CCSD electronic archive server

ENOS: a HolisticFramework forConducting ScientificEvaluations of OpenStack

Author: Cherrueau Ronan-Alexandre
Lebre Adrien
Pertin Dimitri
Simonet Anthony
Simonin Matthieu
Publication venue: HAL CCSD
Publication date: 30/11/2016
Field of study

STACK_HCERES2020By massively adopting OpenStack for operating small to large private and public clouds, the industry has made it one of the largest running software project. Driven by an incredibly vibrant community, OpenStack has now overgrown the Linux kernel. However, with success comes an increased complexity; facing technical and scientific challenges, developers are in great difficulty when testing the impact of individual changes on the performance of such a large codebase, which will likely slow down the evolution of OpenStack. In the light of the difficulties the OpenStack community is facing, we claim that it is time for our scientific community to join the effort and get involved in the development and the evolution of OpenStack, as it has been once done for Linux. However, diving into complex software such as OpenStack is tedious: reliable tools are necessary to ease the efforts of our community and make science as collaborative as possible.In this spirit, we developed ENOS, an integrated framework that relies on container technologies for deploying and evaluating OpenStack on any testbed. ENOS allows researchers to easily express different configurations, enabling fine-grained investigations of OpenStack services. ENOS collects performance metrics at runtime and stores them for post-mortem analysis and sharing. The relevance of ENOS approach to reproducible research is illustrated by evaluating different OpenStack scenarios on the Grid’5000 testbed

INRIA a CCSD electronic archive server

D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

Author: Abbes Heithem
Anjos Julio
Ben Cheikh Asma
Fedak Gilles
He Haiwu
Jin Hai
Lu Lu
Moca Mircea
Saray José-Francisco
Shi Xuanhua
Silaghi Gheorghe,
Simonet Anthony
Tang Bing
Publication venue: HAL CCSD
Publication date: 19/12/2015
Field of study

International audienceSince its introduction in 2004 by Google, MapRe-duce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D 3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3-MapReduce environment

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Big Data Pipelines on the Computing Continuum: Tapping the Dark Data

Author: Elvesæter Brian
Kharlamov Evgeny
Kimovski Dragi
Ledakis Giannis
Leotta Francesco
Marrella Andrea
Matskin Mihhail
Nikolov Nikolay
Prodan Radu
Roman Dumitru
Simonet-Boulogne Anthony
Song Hui
Soylu Ahmet
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

The computing continuum enables new opportunities for managing big data pipelines concerning efficient management of heterogeneous and untrustworthy resources. We discuss the big data pipelines lifecycle on the computing continuum and its associated challenges, and we outline a future research agenda in this area.acceptedVersio

SINTEF Open